home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Celestin Apprentice 5
/
Apprentice-Release5.iso
/
Source Code
/
Libraries
/
VideoToolbox 96.06.15
/
(Notes)
/
Fast blitting.doc
< prev
next >
Wrap
Text File
|
1996-01-25
|
29KB
|
731 lines
C.S.M.P. Digest Tue, 19 Dec 95 Volume 3 : Issue 128
>From erichsen@pacificnet.net (Erichsen)
Subject: Doubles Vs BlockMove
Date: 16 Nov 1995 02:22:08 GMT
Organization: Disorganized
I did some tests (modifying the code in MoveData app from Tricks of the
Mac Game Programming Gurus) between using doubles in a loop and BlockMove
in a loop and BlockMove still blew it away (200 ticks vs 146 ticks for
BlockMove) so why don't more people use BlockMove?
I compared BlockMove vs BlockMoveData and found no difference at all (both
146 ticks). Does BlockMove not flush the cache on a 6100?
One of the replies to my previous question of why people don't just use
BlockMove instead of a copying loop was that the data is not necessarily a
block but, all the examples of blitters I've seen just copy one contiguous
block of memory to another contiguous block of memory. Why couldn't
BlockMove be used?
+++++++++++++++++++++++++++
>From cameron_esfahani@powertalk.apple.com (Cameron Esfahani)
Date: Mon, 20 Nov 1995 11:55:46 -0800
Organization: Apple Computer, Inc.
BlockMove/BlockMoveData on the first generation PPC are exactly the same
function. The reason that BlockMoveData was created in the first place
was you could tell the system you were not moving code around and to not
flush the instructino cache. Since the 601 has a unified cache, this
means that you don't have to worry about cache-coherency. This means you
don't have to flush the processor cache.
The reason most people don't use BlockMove/BlockMoveData as a blitter is
that it will be very very slow if you ever use the screen as the
destination. The reason is that the BlockMove/BlockMoveData routines use
the PPC instruction DCBZ. This instruction will cause a data-exception
fault if the address supplied is not copy-back cacheable. The screen
isn't marked copy-back cacheable.
Hope this helps,
Cameron Esfahani
********
>From nporcino@sol.uvic.ca (Nick Porcino)
Date: 20 Nov 1995 20:35:30 GMT
Organization: Planet IX
We did some tests and found on a Q700 that BlockMoveData was faster than
BlockMove in the context of an actual game (Riddle of Master Lu)
- Nick Porcino
Lead Engine Guy
Sanctuary Woods
+++++++++++++++++++++++++++
>From meggs@virginia.edu (Andrew Meggs)
Date: Tue, 21 Nov 1995 02:55:08 GMT
Organization: University of Virginia
In article <erichsen-1511951722510001@pm2-3.pacificnet.net>,
erichsen@pacificnet.net (Erichsen) wrote:
> I did some tests (modifying the code in MoveData app from Tricks of the
> Mac Game Programming Gurus) between using doubles in a loop and BlockMove
> in a loop and BlockMove still blew it away (200 ticks vs 146 ticks for
> BlockMove) so why don't more people use BlockMove?
>
This got me interested, so I went and disassembled BlockMove. Surprisingly,
they aren't using doubles:
BlockMove
+00060 40A1C558 lwz r5,0x0000(r3)
+00064 40A1C55C lwz r6,0x0004(r3)
+00068 40A1C560 lwz r7,0x0008(r3)
+0006C 40A1C564 lwz r8,0x000C(r3)
+00070 40A1C568 lwz r9,0x0010(r3)
+00074 40A1C56C lwz r10,0x0014(r3)
+00078 40A1C570 lwz r11,0x0018(r3)
+0007C 40A1C574 lwz r12,0x001C(r3)
+00080 40A1C578 dcbz 0,r4
+00084 40A1C57C addi r3,r3,0x0020
+00088 40A1C580 dcbt 0,r3
+0008C 40A1C584 stw r5,0x0000(r4)
+00090 40A1C588 stw r6,0x0004(r4)
+00094 40A1C58C stw r7,0x0008(r4)
+00098 40A1C590 stw r8,0x000C(r4)
+0009C 40A1C594 stw r9,0x0010(r4)
+000A0 40A1C598 stw r10,0x0014(r4)
+000A4 40A1C59C stw r11,0x0018(r4)
+000A8 40A1C5A0 stw r12,0x001C(r4)
+000AC 40A1C5A4 addi r4,r4,0x0020
+000B0 40A1C5A8 bdnz BlockMove+00060
The performance win is in the dcbz/dcbt pair. I'm assuming you weren't
copying to video memory, because that's marked uncacheable, and dcbz will
severely hurt performance if your destination is uncacheable.
I probably would have written it more like this, personally. Does anyone
have any idea what makes Apple's better? (Assuming it is...)
;assume source, destination, and size are all 32-byte aligned
;set r3 to source address minus 8 and r4 to destination address minus 8
;set ctr to size >> 5
BlockMoveLoop
lfd fp0,8(r3)
lfd fp1,16(r3)
lfd fp2,24(r3)
lfdu fp3,32(r3)
dcbz 0,r4
dcbt 0,r3
stfd fp0,8(r4)
stfd fp1,16(r4)
stfd fp2,24(r4)
stfdu fp3,32(r4)
bdnz BlockMoveLoop
> I compared BlockMove vs BlockMoveData and found no difference at all (both
> 146 ticks). Does BlockMove not flush the cache on a 6100?
>
The unified instruction and data cache on the 601 wouldn't cause any
problems with treating code as data, so there's no need to maintain
coherency between the two. In other words, it shouldn't, but on the
604 it would need to.
--
_________________________________________________________________________
andrew meggs the one who dies with the most
meggs@virginia.edu AOL free trial disks wins
_________________________________________________________________________
dead tv software --==-- the next generation of 3D games for the macintosh
<http://darwin.clas.virginia.edu/~apm3g/deadtv/index.html>
+++++++++++++++++++++++++++
>From Mark Williams <Mark@streetly.demon.co.uk>
Date: Wed, 22 Nov 95 09:42:32 GMT
Organization: Streetly Software
In article <meggs-2011952155080001@bootp-188-82.bootp.virginia.edu>, Andrew Meggs writes:
>
> In article <erichsen-1511951722510001@pm2-3.pacificnet.net>,
> erichsen@pacificnet.net (Erichsen) wrote:
>
> > I did some tests (modifying the code in MoveData app from Tricks of the
> > Mac Game Programming Gurus) between using doubles in a loop and BlockMove
> > in a loop and BlockMove still blew it away (200 ticks vs 146 ticks for
> > BlockMove) so why don't more people use BlockMove?
> >
>
> This got me interested, so I went and disassembled BlockMove. Surprisingly,
> they aren't using doubles:
>
> BlockMove
> +00060 40A1C558 lwz r5,0x0000(r3)
> +00064 40A1C55C lwz r6,0x0004(r3)
> +00068 40A1C560 lwz r7,0x0008(r3)
> +0006C 40A1C564 lwz r8,0x000C(r3)
> +00070 40A1C568 lwz r9,0x0010(r3)
> +00074 40A1C56C lwz r10,0x0014(r3)
> +00078 40A1C570 lwz r11,0x0018(r3)
> +0007C 40A1C574 lwz r12,0x001C(r3)
> +00080 40A1C578 dcbz 0,r4
> +00084 40A1C57C addi r3,r3,0x0020
> +00088 40A1C580 dcbt 0,r3
> +0008C 40A1C584 stw r5,0x0000(r4)
> +00090 40A1C588 stw r6,0x0004(r4)
> +00094 40A1C58C stw r7,0x0008(r4)
> +00098 40A1C590 stw r8,0x000C(r4)
> +0009C 40A1C594 stw r9,0x0010(r4)
> +000A0 40A1C598 stw r10,0x0014(r4)
> +000A4 40A1C59C stw r11,0x0018(r4)
> +000A8 40A1C5A0 stw r12,0x001C(r4)
> +000AC 40A1C5A4 addi r4,r4,0x0020
> +000B0 40A1C5A8 bdnz BlockMove+00060
>
>
> The performance win is in the dcbz/dcbt pair. I'm assuming you weren't
> copying to video memory, because that's marked uncacheable, and dcbz will
> severely hurt performance if your destination is uncacheable.
>
> I probably would have written it more like this, personally. Does anyone
> have any idea what makes Apple's better? (Assuming it is...)
consecutive stfd's stall both pipelines. This means that (assuming all cache hits) you get one fp
store every 3 cycles, compared with one integer store every cycle. The result is 12 cycles to
transfer 4 words using fp registers, but only 10 cycles using integer registers. (see page I-175 of
the 601 User manual).
> ;assume source, destination, and size are all 32-byte aligned
> ;set r3 to source address minus 8 and r4 to destination address minus 8
> ;set ctr to size >> 5
>
> BlockMoveLoop
> lfd fp0,8(r3)
> lfd fp1,16(r3)
> lfd fp2,24(r3)
> lfdu fp3,32(r3)
> dcbz 0,r4
> dcbt 0,r3
> stfd fp0,8(r4)
> stfd fp1,16(r4)
> stfd fp2,24(r4)
> stfdu fp3,32(r4)
> bdnz BlockMoveLoop
>
One other problem with your code (and presumably why apple use the apparently wasteful addi
instructions rather than load/store with update) is that your dcbt instruction comes too late... fp3
already contains the double at r3 by the time you hit the dcbt 0,r3 instruction, so it has no
effect. Much worse, the dcbz always touches the block you wrote the _previous_ time through the
loop...
this could easily be fixed by preloading r5 with 8 and writing
dcbz r5,r4
dcbt r5,r3
But you would still lose out on a 601. I _think_ it would be quicker on a 604, but i've not
checked.
- --------------------------------------
Mark Williams<Mark@streetly.demon.co.uk>
+++++++++++++++++++++++++++
>From cameron_esfahani@powertalk.apple.com (Cameron Esfahani)
Date: Tue, 28 Nov 1995 01:24:06 -0800
Organization: Apple Computer, Inc.
BlockMoveData was introduced with System 7.5. The code for
it was kicking around Apple for a little while before we had a shipping
vehicle for it.
Cameron Esfahani
+++++++++++++++++++++++++++
>From deirdre@deeny.mv.com (Deirdre)
Date: Tue, 28 Nov 1995 14:46:04 GMT
Organization: Tarla's Secret Clench
BlockMove was available in System 1.0. However, the distinction between
BlockMove and the newer call BlockMoveData is only significant on 040s and
higher. On other machines it is the same trap.
_Deirdre
+++++++++++++++++++++++++++
>From kenp@nmrfam.wisc.edu (Ken Prehoda)
Date: Wed, 29 Nov 1995 09:26:05 -0600
Organization: Univ of Wisconsin-Madison, Dept of Biochemistry
As far as I can tell BlockMoveData is _only_ significant on the 040.
BlockMove does not flush the cache on the PPC's.
_____________________________________________________________________________
Ken Prehoda kenp@nmrfam.wisc.edu
Department of Biochemistry http://www.nmrfam.wisc.edu
University of Wisconsin-Madison Tel: 608-263-9498
420 Henry Mall Fax: 608-262-3453
+++++++++++++++++++++++++++
>From cameron_esfahani@powertalk.apple.com (Cameron Esfahani)
Date: Wed, 29 Nov 1995 22:53:41 -0800
Organization: Apple Computer, Inc.
> As far as I can tell BlockMoveData is _only_ significant on the 040.
> BlockMove does not flush the cache on the PPC's.
That is not true. BlockMove does flush the cache on the new PPCs. Any
PPC with a split cache (603/604 and any other ones) will require cache
flushing. So, BlockMove on a 601-based machine doesn't flush the cache
because it makes no sense, but on > 601-machines, it does flush.
Cameron Esfahani
+++++++++++++++++++++++++++
>From mick@emf.net (Mick Foley)
Date: Wed, 29 Nov 1995 22:23:29 -0800
Organization: "emf.net" Quality Internet Access. (510) 704-2929 (Voice)
> As far as I can tell BlockMoveData is _only_ significant on the 040.
> BlockMove does not flush the cache on the PPC's.
Not on the 601 which has a unified cache. But it should make a big
difference on the 603 and 604 which have split data and code caches.
Mick
+++++++++++++++++++++++++++
>From Ed Wynne <arwyn@engin.umich.edu>
Date: 4 Dec 1995 04:09:26 GMT
Organization: Arwyn, Inc.
Actually, thats almost right... BlockMoveData CAN cause cache flushing on
601-based machines if they are running the DR emulator. The processor
cache doesn't get flushed but the emulator's internal cache of recompiled code
does. This process is probably a fair amount slower than the real on-chip
cache flush since it is a software based operation.
To my knowledge the only machines so-far with this configuration would be
the 7200 and 7500. (does the 8500 have a 601 option?)
-ed
---------------------------
C.S.M.P. Digest Tue, 26 Dec 95 Volume 3 : Issue 129
---------------------------
>From steele@isi.edu (Craig S. Steele)
Subject: Block copy on 604 slow
Date: Tue, 5 Dec 1995 18:30:53 -0800
Organization: USC Information Sciences Institute
I'm trying to benchmark block copy rates of various sizes for PowerPCs. My
results are disappointing for the 604, and cause me to wonder what it is I
don't understand. Testing on a 9500/120, to which I have limited access, gives
the following results for copy code using 32-bit integer and 64-bit double
load and stores, respectively:
Asm lvector copy of 1024 doubles in 44.8 nS/acc, 5.4 clocks/acc, 85.1 MB/s
Asm dvector copy of 1024 doubles in 34.1 nS/acc, 4.1 clocks/acc, 111.9 MB/s
The source array is aligned to 4K, the destination array to 4K+0x100, to avoid
possible aliasing interlocks. The source array is preloaded immediately
before the copy routine is called, so I would expect everything to run at L1
cache rates.
I would naively expect the copy code to average about 1.5 clocks per load or
store. Instead, my code reports over 4 clocks/access. The code uses the
time-base register for timing, which shouldn't cause significant cache
disturbance.
Can anyone contradict, corroborate, or explain my poor results? If I can't do
better than this, we'll have to build extra hardware :-(
Thanks in advance.
-Craig
exportf2 dvec_copy
mtctr r5 ; init loop counter
addi r3,r3,-8 ; predecrement pointer by double size
addi r4,r4,-8 ; predecrement pointer by double size
li r6,8 ; cache line alignment constant for dcbz
b dvc_1
align 6
dvc_1
dcbz r6,r3 ; kill dest. cache line
lfd fp0,8(r4)
lfd fp1,16(r4)
lfd fp2,24(r4)
lfdu fp3,32(r4)
stfd fp0,8(r3)
stfd fp1,16(r3)
stfd fp2,24(r3)
stfdu fp3,32(r3)
bdnz dvc_1 ; test loop condition
blr
Craig S. Steele - Not yet Institutionalized
+++++++++++++++++++++++++++
>From rbarris@netcom.com (Robert Barris)
Date: Wed, 6 Dec 1995 09:46:47 GMT
Organization: NETCOM On-line Communication Services (408 261-4700 guest)
In article <9512051830.AA53505@kandor.isi.edu>,
Craig S. Steele <steele@isi.edu> wrote:
>I'm trying to benchmark block copy rates of various sizes for PowerPCs. My
>results are disappointing for the 604, and cause me to wonder what it is I
>don't understand. Testing on a 9500/120, to which I have limited access, gives
>the following results for copy code using 32-bit integer and 64-bit double
>load and stores, respectively:
>
>Asm lvector copy of 1024 doubles in 44.8 nS/acc, 5.4 clocks/acc, 85.1 MB/s
>Asm dvector copy of 1024 doubles in 34.1 nS/acc, 4.1 clocks/acc, 111.9 MB/s
OK, in regular "bytes", you appear to be moving (for examples sake)
8192 bytes
from address (say) 0x1000000
to address (say) 0x1002100.
So you are reading 8K and writing 8K as I read it... in a perfect world
all of your data would fit (precisely) into the L1 d cache.
>The source array is aligned to 4K, the destination array to 4K+0x100, to avoid
>possible aliasing interlocks. The source array is preloaded immediately
>before the copy routine is called, so I would expect everything to run at L1
>cache rates.
Except that you are sharing that L1 with things like interrupt tasks, 68K
interrupt tasks (which invoke the emulator causing additional pollution),
and so on.
Since as far as I know, there is no way to completely shut off PowerPC
interrupts, quantifying the effect of background processes on your cache
population can be a bit tricky.
>I would naively expect the copy code to average about 1.5 clocks per load or
>store. Instead, my code reports over 4 clocks/access. The code uses the
>time-base register for timing, which shouldn't cause significant cache
>disturbance.
When you say per access, do you mean per double "moved" as in a read and
a write, or per double accessed, as in the read or the write alone?
I guess I can work it out: 110MB/s (say it's 120 for arguments sake) is
about 1MB per million clocks (at 120MHz). Or about a byte moved per clock, or
a double moved per 8 clocks. OK so that's 4 per double read, 4 per
double write (on average).
Suggestions:
1. Plot speed versus vector length. Look for nonlinearities.
(deliberately shrink or grow the vector).
2. wiggle that 256 byte offset factor some more. or make it zero.
I do not think the 4-wayness would become a problem until you went
above 8K vectors, then very little would help...
3. think about cache hinting at or near the bottom of the loop.
if for some reason a cache line which you are going to read from
has been dropped, it's good to schedule its re-fetch as far ahead as
possible. I'm sure Tim Olson can elaborate much more better good :)
4. I hear Exponential Technology has a faster BiCMOS 604 coming...
Rob Barris
Quicksilver Software Inc.
rbarris@quicksilver.com
* opinions expressed not necessarily those of my employer *
+++++++++++++++++++++++++++
>From steele@isi.edu (Craig S. Steele)
Date: Wed, 6 Dec 1995 12:41:15 -0800
Organization: USC Information Sciences Institute
In article <rbarrisDJ5sHz.MJy@netcom.com>, rbarris@netcom.com (Robert Barris)
writes:
> In article <9512051830.AA53505@kandor.isi.edu>, Craig S. Steele
> <steele@isi.edu> wrote:
> >I'm trying to benchmark block copy rates of various sizes for
> >PowerPCs. My results are disappointing for the 604, and cause me
> >to wonder what it is I don't understand. Testing on a 9500/120, to
> >which I have limited access, gives the following results for copy
> >code using 32-bit integer and 64-bit double load and stores,
> >respectively:
> >
> >Asm lvector copy of 1024 doubles in 44.8 nS/acc, 5.4 clocks/acc, 85.1
MB/s
> >Asm dvector copy of 1024 doubles in 34.1 nS/acc, 4.1 clocks/acc, 111.9
MB/s
> So you are reading 8K and writing 8K as I read it... in a perfect
> world all of your data would fit (precisely) into the L1 d cache.
Exactly. However, I did benchmark a range of copy sizes from 512B to 1MB; the
quoted 8KB block results were the fastest. Needless to say the rate for
larger blocks dropped precipitously as the sizes busted (burst?) the L1 and L2
caches.
> >...so I would expect everything to run at L1 cache rates.
> Except that you are sharing that L1 with things like interrupt
> tasks, 68K interrupt tasks (which invoke the emulator causing
> additional pollution), and so on.
True. I would have thought that at least some of my trials would have fit in
between interrupts, e.g., the critical part of the 8KB case only takes about
100uS, and the smaller proportionately less. I also tried back-to-back copy
calls, producing essentially identical results. I did get _much_ worse results
when I experimented with using the MacOS Microseconds call for timing, so the
cache pollution issue is very real. What is the highest rate interrupt source
on an idle PowerMac anyway?. Is Microseconds non-native? I'm clueless.
> Since as far as I know, there is no way to completely shut off
> PowerPC interrupts, quantifying the effect of background processes on
> your cache population can be a bit tricky.
I believe I know how to do it on an 8100 (although not the 9500) so it's
probably worth a (probable deathcookies) experiment to see it if makes a
difference there. I deeply regret having blown up our only hardware prototype
last month... Maybe next week I'll have a bare machine again, knock on
Formica(TM).
> >I would naively expect the copy code to average about 1.5 clocks per
> >load or store. Instead, my code reports over 4 clocks/access.
> I guess I can work it out ... OK so that's 4 per double
> read, 4 per double write (on average).
Yes.
> Suggestions:
> 1. Plot speed versus vector length. Look for nonlinearities.
> (deliberately shrink or grow the vector).
For a 9500/120:
512B 49 MB/s
1KB 68
2KB 87
4KB 109
8KB 112
16KB 68
32KB 62
64KB 54
128KB 53
256KB 40
512KB 35
1024KB 32
The trends are reasonable, it's just the L1 peak rate that seems very low to
me. The 6100 and 8100, on the other hand, have some huge huge anomalous dips
for 128KB operations, presumably managing to evict the code from both the L1 &
L2 unified caches in some particularly malign way.
> 2. wiggle that 256 byte offset factor some more. or make it zero.
Zero makes things about 10% slower, but I haven't yet tried other offsets.
> 3. think about cache hinting at or near the bottom of the loop.
> if for some reason a cache line which you are going to read from
> has been dropped, it's good to schedule its re-fetch as far ahead
> as possible.
A prior load loop is supposed to have ensured that the source is in the cache,
but this is a good suggestion to double check that assumption, and probably
the right thing to do for a general-purpose copy where cache status is
uncontrolled. I'll check this out.
> 4. I hear Exponential Technology has a faster BiCMOS 604 coming...
That certainly does look interesting, "only" $14million capitalization, but
good credentials. Unfortunately, I have to put something under the tree for
this Christmas, can't wait for that rosy glow ("Is it Rudolph or is it
bipolar?") we might see next. :-)
Craig S. Steele - Not yet Institutionalized
+++++++++++++++++++++++++++
>From tim@apple.com (Tim Olson)
Date: 7 Dec 1995 03:33:26 GMT
Organization: Apple Computer, Inc. / Somerset
In article <9512051830.AA53505@kandor.isi.edu>
steele@isi.edu (Craig S. Steele) writes:
> I would naively expect the copy code to average about 1.5 clocks per load or
> store. Instead, my code reports over 4 clocks/access. The code uses the
> time-base register for timing, which shouldn't cause significant cache
> disturbance.
>
> Can anyone contradict, corroborate, or explain my poor results?
I did a number of measurements awhile back which showed that a 604 can
perform the loop you gave (without the DCBZ) at about 1.3 cycles per
doubleword loaded or stored -- this was done by measuring the runtime
of copying a 64-byte block over many iterations, so both source and
destination were in the cache. The DCBZ instruction spends multiple
cycles clearing the allocated cache block, so that will add some
overhead (I don't have my spec with me -- I seem to remember it is 4
cycles), which should bring it to somewhere around 15 cycles per loop
iteration, or about 1.8 cycles per doubleword, which is still far less
than your reported 4 cycles.
First, try running without the DCBZ to see if it more closely matches
my results (~1.3 cycles per doubleword); if not, then you might be
forgetting about some multiplication factor when using the timebase
register. On the 604, it increments every 4th bus clock.
-- Tim Olson
Apple Computer, Inc. / Somerset
tim@apple.com
+++++++++++++++++++++++++++
>From cliffc@ami.sps.mot.com (Cliff Click)
Date: 7 Dec 95 09:23:08
Organization: none
steele@isi.edu (Craig S. Steele) writes:
Craig S. Steele <steele@isi.edu> wrote:
>I'm trying to benchmark block copy rates of various sizes for
>PowerPCs. My results are disappointing for the 604, and cause me
>to wonder what it is I don't understand. Testing on a 9500/120, to
>which I have limited access, gives the following results for copy
>code using 32-bit integer and 64-bit double load and stores,
>respectively:
Have you tried using "lmw" and "stmw" instead of "lfd" and "stfd"?
My 604 book sez these are #regs+2 cycles each, whilst the float
operations are 3 cycles each. For large enough blocks, you should
win on the lmw and stmw.
Cliff
--
Cliff Click Compiler Researcher & Designer
RISC Software, Motorola PowerPC Compilers
cliffc@risc.sps.mot.com (512) 891-7240
+++++++++++++++++++++++++++
>From tim@apple.com (Tim Olson)
Date: 8 Dec 1995 02:59:57 GMT
Organization: Apple Computer, Inc. / Somerset
In article <CLIFFC.95Dec7092308@ami.sps.mot.com>
cliffc@ami.sps.mot.com (Cliff Click) writes:
> Have you tried using "lmw" and "stmw" instead of "lfd" and "stfd"?
> My 604 book sez these are #regs+2 cycles each, whilst the float
> operations are 3 cycles each. For large enough blocks, you should
> win on the lmw and stmw.
The lfd instruction has a 3-cycle latency for using the result of the
load in a floating-point operation, but the issue-rate of lfd is one
per cycle. When pipelined in the manner used in the block copy code,
it can transfer at close to one doubleword per cycle.
Load and store multiple instructions can achieve close to one word per
cycle for large transfers, but that is half the bandwith of the
lfd/stfd solution.
-- Tim Olson
Apple Computer, Inc. / Somerset
tim@apple.com
+++++++++++++++++++++++++++
>From Mark Williams <Mark@streetly.demon.co.uk>
Date: Thu, 07 Dec 95 18:25:26 GMT
Organization: Streetly Software
In article <CLIFFC.95Dec7092308@ami.sps.mot.com>, Cliff Click writes:
>
> steele@isi.edu (Craig S. Steele) writes:
>
> Craig S. Steele <steele@isi.edu> wrote:
> >I'm trying to benchmark block copy rates of various sizes for
> >PowerPCs. My results are disappointing for the 604, and cause me
> >to wonder what it is I don't understand. Testing on a 9500/120, to
> >which I have limited access, gives the following results for copy
> >code using 32-bit integer and 64-bit double load and stores,
> >respectively:
>
> Have you tried using "lmw" and "stmw" instead of "lfd" and "stfd"?
> My 604 book sez these are #regs+2 cycles each, whilst the float
> operations are 3 cycles each. For large enough blocks, you should
> win on the lmw and stmw.
>
> Cliff
> --
> Cliff Click Compiler Researcher & Designer
> RISC Software, Motorola PowerPC Compilers
> cliffc@risc.sps.mot.com (512) 891-7240
But surely the point is that lfd & stfd have a _latency_ of 3 cycles, but a
throughput of 1 instruction per cycle, whereas the lmw/stmw have both a latency
and throughput of 1 instruction per #regs+2 cycles. That means the lfd/stfd
method should be able to move (ie load and store) 1 word per cycle, while the
lmw/stmw cannot do better than 1 word every 2 cycles (and even with 28 regs
available it would take 60 cycles to move 28 words).
- --------------------------------------
Mark Williams<Mark@streetly.demon.co.uk>
+++++++++++++++++++++++++++
>From tjrob@bluebird.flw.att.com (Tom Roberts)
Date: Sat, 9 Dec 1995 19:19:22 GMT
Organization: AT&T Bell Laboratories
In article <4a89nd$hrp@cerberus.ibmoto.com>, Tim Olson <tim@apple.com> wrote:
>In article <CLIFFC.95Dec7092308@ami.sps.mot.com>
>cliffc@ami.sps.mot.com (Cliff Click) writes:
>
>> Have you tried using "lmw" and "stmw" instead of "lfd" and "stfd"?
>> My 604 book sez these are #regs+2 cycles each, whilst the float
>> operations are 3 cycles each. For large enough blocks, you should
>> win on the lmw and stmw.
>
>The lfd instruction has a 3-cycle latency for using the result of the
>load in a floating-point operation, but the issue-rate of lfd is one
>per cycle. When pipelined in the manner used in the block copy code,
>it can transfer at close to one doubleword per cycle.
>
>Load and store multiple instructions can achieve close to one word per
>cycle for large transfers, but that is half the bandwith of the
>lfd/stfd solution.
In practical systems, memory bandwidth is MUCH more important than
the number of instructions used or their throughput or latency.
(This assumes that the data actually resides in memory, not just in the
cache. This also assumes a "long" loop, so the code is in the icache.)
In systems which run the 604 at 1:1 clocking (i.e. internal CPU clock
equals external bus clock), memory bandwidth can be 2-4 times slower than
simple calculations. This is due to cache-access limitations and the
fact that both the CPU and the bus access unit are competing for
access to the cache. In this mode the memory essentially NEVER
overlaps address and data tenures on the bus (halving memory bandwidth);
there are usually several bus clocks between succesive cycles, reducing
bandwidth even more.
With 1.5:1 clocking this effect is reduced -- the cache can handle
one access per internal clock, so there is a cycle available to the
CPU between every 2 bus accesses. At 2:1 this effect should disappear,
as the CPU can get every other cycle, and keep up with the memory
bus bandwidth.
Note that only recently have 604 chips been shipping which can go 1.5:1
at 66 MHz bus clock.
Tom Roberts tjrob@iexist.att.com
---------------------------